10 August 2019

Today

Fraud

Definition

Fraud = scientifc misconduct.

  • Falsifying or fabricating data.
  • This is intentional, not accidental.
  • Puts all science under a bad light.
  • Markedly different from QRPs (next).

Notable examples

Today we don’t talk about fraud explicitly.

We talk about something much harder to identify and erradicate:

Questionable research practices.

Questionable research practices

QRPs

Coin termed by John, Loewenstein, & Prelec (2012).
See also Simmons, Nelson, & Simonsohn (2011).

  • Not necessarily fraud.
  • Includes the (ab)use of actually acceptable research practices.
  • Problem with QRPs:
    • Introduce bias (typically, in favor of the researcher’s intentions…).
    • Inflated power at the cost of inflated Type I error probability (\(\gg 5\%\)).
    • Results not replicable.

Example of QRPs

(John et al., 2012; Schimmack, 2015).

  • Omit some DVs.
  • Omit some conditions.
  • Peeking: Sequential testing — Look and decide:
    • \(p > .05\): Collect more.
    • \(p < .05\): Stop.
  • Only report \(p<.05\) results.
  • \(p\)-hacking: E.g.,
    • Exclusion of outliers depending on whether \(p<.05\).
    • \(p = .054 \longrightarrow p = .05\).
  • HARKing (Kerr, 1998): Convert exploratory results into research questions.

Researcher’s degrees of freedom

  • Researchers have a multitude of decisions to make (experiment design, data collection, analyses performed); Wicherts et al. (2016), Steegen, Tuerlinckx, Gelman, & Vanpaemel (2016).
  • It is very possible to manipulate results in favor of one’s interests.
  • This is now known as researcher’s degrees of freedom (Simmons et al., 2011).
  • Consequence: Inflated false positive findings (Ioannidis, 2005).

Fried (2017)

  • The 7 most common depression scales contain 52 symptoms.
  • That’s 7 different sum scores.
  • Yet, all are interpreted as `level of depression’.

A now famous example…

Prof. Brian Wansink at Cornell University.

His description of the efforts of a visiting Ph.D student:

I gave her a data set of a self-funded, failed study which had null results (…). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). I told her what the analyses should be and what the tables should look like. I then asked her if she wanted to do them.

Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that held up regardless of how we pressure-tested them. I outlined the first paper, and she wrote it up (…). This happened with a second paper, and then a third paper (which was one that was based on her own discovery while digging through the data).

This isn’t being creative or thinking outside the box.

This is QRPing.

What happened to Wansink?

  • He was severely criticized, his work was scrutinized (e.g., van der Zee, Anaya, & Brown, 2017).
  • Over 100 (!!) errors in a set of four papers…
  • Has now 40 (!!) publications retracted (as of July 2019).
  • After a year-long internal investigation, he was forced to resign.

Is it really that bad?…

Yes.

  • Martinson, Anderson, & Vries (2005): “Scientists behaving badly”.
  • Fanelli (2009): Meta-analysis shows evidence of science misconduct.
  • John et al. (2012): Evidence for QRPs in psychology.
  • Mobley, Linder, Braeuer, Ellis, & Zwelling (2013): Reported evidence of pressure to find significant results.
  • Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli (2017): Evidence of QRPs, now in Italy.
  • Fraser, Parker, Nakagawa, Barnett, & Fidler (2018): In other fields of science.

Interestingly, science misconduct has been a longtime concern (see Babbage, 1830).

And for the sake of balance:
There are also some voices against this description of the current state of affairs (e.g., Fiedler & Schwarz, 2016).

But why?…

Why are QRPs so prevalent?

It is strongly related to incentives (Nosek, Spies, & Motyl, 2012; Schönbrodt, 2015).

  • “Publish or perish”:
    Publish a lot, at highly prestigious journals.
    • Journals only publish a fraction of all manuscripts.
    • Journals don’t like publishing null findings…
  • Get tenured.
  • Get research grant.
  • Fame (prizes, press coverage, …).



But, very importantly, it also happens in spite of the researcher’s best intentions.

  • Deficient statistics education (yes, statisticians need to acknowledge this!…).
  • Perpetuating traditions in the field.

(I)reproducibility

Threats to reproducible science

Munafò et al. (2017)

  • Hypothetico-deductive model of the scientific method.
  • In red: Potential threats to this model.

Lack of replications

Until very recently (Makel, Plucker, & Hegarty, 2012).

  • Very low rate of replications in Psychology (estimated ~1%).
  • Until 2012, majority of replications were actually successful!!
  • However, in most cases both the original and replication studies shared authorship…
  • Conflict of interest?…

Famous replication failures

Didn’t we see this coming?

Meehl (1967)

How poorly we build theory (see Gelman):

“It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (…) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through (…) a long series of related experiments (…) without ever once refuting or corroborating so much as a single strand of the network.”

Say what?…

Cohen (1962)

Low-powered experiments:

“(…) It was found that the average power (probability of rejecting false null hypotheses) over the 70 research studies was .18 for small effects, .48 for medium effects, and .83 for large effects. These values are deemed to be far too small.”

“(…) it is recommended that investigators use larger sample sizes than they customarily do.”

Kahneman (2012), see here

Nobel prize winner, 2002.

About priming effects (but quite general remarks…):

“The storm of doubts is fed by (…) the recent exposure of fraudulent researchers, general concerns with replicability (…), multiple reported failures to replicate salient results (…), and the growing belief in the existence of a pervasive file drawer problem (…).”

“My reason for writing this letter is that I see a train wreck looming.”

“I believe that you should collectively do something about this mess.”

Timeline of a train wreck

  • Gelman blogged about an impressive timeline about the replication crisis.
  • The whole blog post is worth reading for many reasons, including Gelman’s criticism over criticism!
    (versus Susan Fiske’s position).

Timeline of a train wreck

\(p\)-values

Definition

Probability of an effect at least as extreme as the one we observed, given that \(\mathcal{H}_0\) is true.

\[\fbox{$ p\text{-value} = P\left(X_\text{obs} \text{ or more extreme}|\mathcal{H}_0\right) $}\]

The definition is simple enough, right?…

Test yourself

Consider the following statement (Falk & Greenbaum, 1995; Gigerenzer, Krauss, & Vitouch, 2004; Haller & Kraus, 2002; Oakes, 1986):

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). Furthermore, suppose you use a simple independent means \(t\)-test and your result is significant (\(t = 2.7\), \(df = 18\), \(p = .01\)). Please mark each of the statements below as “true” or “false.” False means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.

Test yourself

Results

All statements are incorrect.

Results

But how did students and teachers perceive these statements?

This was in 2004. But things did not improve since…

Goodman (2008)


Greenland et al. (2016)




This paper expands Goodman (2008) and elaborates on 25 misinterpretations.

The American Statistician (2019)

Special issue with 43 (!!) papers (Wasserstein, Schirm, & Lazar, 2019).

Moving to a world beyond “\(p<.05\)”

Confidence intervals

A better alternative?

  • Confidence intervals (CIs) have been often advocated as the best inferential alternative to NHST.
  • Recall, for example the Wilkinson Task Force (Wilkinson & Task Force on Statistical Inference, 1999):

“(…) it is hard to imagine a situation in which a dichotomous accept–reject decision is better than reporting an actual \(p\) value or, better still, a confidence interval.”

  • But, are CIs really a better alternative?

Definition

See, for instance, Hoekstra, Morey, Rouder, & Wagenmakers (2014).

A (say) 95% CI is a numerical interval found through a procedure that, if repeated across a series of hypothetical data, leads to an interval covering the true parameter 95% of the times.

  • A CI indicates a property of the performance of the procedure used to compute it:
    How is the procedure expected to perform in the long run?
  • A CI for a parameter is constructed around the parameter’s estimate.
  • However, a CI does not (really not!) directly indicate a property of the parameter being estimated!

Confused?
So is the vast majority of social scientists…

Test yourself

From Hoekstra et al. (2014), mimicking the \(p\) value study by Gigerenzer et al. (2004).

Test yourself

Results

All statements are incorrect.

Results

But how did students and teachers perceive these statements?

What would be correct, then?…

“If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean.”

How informative is this?!


Mental note:
Remember this when interpreting Bayesian credible intervals in part 2 of today’s workshop!


For completeness, not everyone agrees with the Hoekstra study (García-Pérez & Alcalá-Quintana, 2016; Miller & Ulrich, 2016; see also a reply by Morey et al., 2016).

Publication policies

Psychological Science (Eich, 2014)

Basic and Applied Social Psychology

“The Basic and Applied Social Psychology (BASP) (…) emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid (…). From now on, BASP is banning the NHSTP.”

Did it actually work? For a reflection, see Fricker, Burke, Han, & Woodall (2019).

Child Adolescent Mental Health

(…) I will encourage authors to provide replication syntax and data through public repositories. Moreover, I will encourage the journal to focus on a manuscript’s research design and the author’s justification thereof, rather than the results, with the aim of ensuring that transparent studies that explore a research question with equipoise, will be published.

The New England Journal of Medicin

Editorial (Harrington et al., 2019).

“(…) a requirement to replace \(p\) values with estimates of effects or association and 95% confidence intervals”

What do statistical associations advice?

Wilkinson Task Force 1999

Among many many, advices,

  • Do not focus on \(p\) values.
  • Report effect sizes.
  • Report power analyses.
  • Check model assumptions.

“Novice researchers err either by overgeneralizing their results or, equally unfortunately, by overparticularizing.”

ASA 2016 (Wasserstein & Lazar, 2016)

Six principles:

  1. \(p\)-values can indicate how incompatible the data are with a specified statistical model.
  2. \(p\)-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a \(p\)-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A \(p\)-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a \(p\)-value does not provide a good measure of evidence regarding a model or hypothesis.

ASA 2019 (Wasserstein et al., 2019)

This is an editorial of a special issue consisting of 43 (!!) papers.

Main ideas:

  • “Don’t” is not enough – Some what to do advices are provided.
  • However… Don’t say “statistically significant” – Just don’t.

“(…) it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “\(p < 0.05\),” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way."

But:

“Despite the limitations of \(p\)-values (…), however, we are not recommending that the calculation and use of continuous \(p\)-values be discontinued. Where \(p\)-values are used, they should be reported as continuous quantities (e.g., \(p = 0.08\)). They should also be described in language stating what the value means in the scientific context.”

  • There is no unique “do”:

“What you will NOT find in this issue is one solution that majestically replaces the outsized role that statistical significance has come to play.”

  • Accept uncertainty (I cannot stress this enough!).
    Be thoughtful, open, and modest.

  • Editorial, educational, and other institutional practices will have to change.
    This includes: Journals, funding agencies, education, career system.

  • Value replicability, open materials and data, and reliable practices (which all take time) over “publish or perish”.

ASA 2019: Also advocate Bayesian statistics

What do experts advice?

Munafò et al. (2017)

Methods:

  • Protecting against cognitive biases
  • Improving methodological training
  • Implementing independent methodological support
  • Encouraging collaboration and team science

Munafò et al. (2017)

Large-scale replication projects

Many Labs (Klein et al., 2014)

Replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants.

See also Many Labs 2 (Klein et al., 2018), Many Labs 3 (Ebersole et al., 2016).

Open Science Collaboration (OSC, 2015)



A gazilion authors.

(For the sake of balance, and for an interesting rebuttal (!), see Gilbert, King, Pettigrew, & Wilson (2016)).

The Psychological Science Accelerator

Moshontz et al. (2018); 92 authors!

Education

Frank & Saxe (2012)

Chambers (2017b)

Button (2018)

Sarafoglou, Hoogeveen, Matzke, & Wagenmakers (2019)

Research Master course on open science practices. Materials freely available at OSF!

Kiers, Hoekstra, Tendeiro, & Van Ravenzwaaij (2019)

Key ideas

Registered reports (RRs)

Visit the Center for Open Science.


Prior to data collection (Chambers, 2013):

  • Decide hypotheses, methods, and analysis.
    (Eliminate several QRPs, e.g., \(p\)-hacking, publication bias by researchers and journals.)
  • Peer review of paper.
  • Conditional acceptance of paper!
  • Not only original studies, but also replications are of value!

Registered reports (RRs)

As of July 2019, 204 journals use Registered Reports.

And recently, quite notably, Nature:

  • January 2017: Editorial announcing RRs (Editorial Nature, 2017).
  • July 2019: Editorial announcing first two RRs (Editorial Nature, 2019):
    • He & Côté (2019)
    • Brannon, Carr, Jin, Josephs, & Gawronski (2019)

To learn:

  • Chambers (2013): Inception of RRs at Cortex in 2013.
  • Read the APS statement.
  • Nosek & Lakens (2014): Special issue in Social Psychology in 2014, with examples.
  • Chambers, Feredoes, Muthukumaraswamy, & Etchells (2014): Includes useful FAQs.
  • Chambers (2017a): Slides at OSF.

Preregistration

Preregistration works (Kaplan & Irvin, 2015)

Replication studies

Brandt et al. (2014)

Concern in major journals

Lindsay (2015)

“(…) Replicability is not the only criterion of a first-rate science journal, but it had better be a fundamental one.”

“My emphasis here is on experiments and NHST, (…) (By the way, I am enthusiastically open to submissions that make appropriate use of alternatives to NHST [read: Bayes].)”

In Nature

In Nature (Camerer et al., 2018)

“The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications.”

The replications are high powered, with sample sizes on average about five times higher than in the original studies."

“We find a significant effect in the same direction as the original study for 13 ( 62%) studies, and the effect size of the replications is on average about 50% of the original effect size.”

Conclusion:
Results published in high rank journals should be considered with care until they are replicated.

The PRO initiative (Morey, Chambers, et al., 2016)

Main goals:

  1. Data should be made publicly available.
  2. Stimuli and materials should be made publicly available.
  3. In case some data or materials are not open, clear reasons (e.g., legal, ethical constraints, or severe impracticality) should be given why.
  4. Documents containing details for interpreting any files or code, and how to compile and run any software programs should be made available with the above items.
  5. The location of all of these files should be advertised in the manuscript, and all files should be hosted by a reliable third party.

‘statcheck’

R package that can assist detecting statistical reporting of errors (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016).

What to avoid

Bullying

  • Debates in blogs, Twitter, and journals can be fierce.
  • Criticism should be part of science, of course.
  • It’s not bullying to criticize, of course, in particular, with grounded reasons (vide Wansink).
  • But sometimes criticism gets too carried away, IMHO.

NYT, 2017

(Interestingly: A recent comeback in Psychological Science.)

Self-appointed police

Most likely, each of us has some skeleton’s in their scientific closets.

We’ve all fallen prey to one or more of the problems mentioned today.

Full disclosure:

I have to!!

So:

No one is better than anyone.

Or in the words of Brian Nosek (as quoted here):

“We’re not here to be right. We’re here to get it right.”

No time today for…

No time today for…

  • Replications projects
  • Registered reports
  • Preregistrations
  • Education

(But we can talk about it too!…)


Today we focus on statistics.

Bayesian statistics

Alternative approach to statistical inference

After the break:

Gentle introduction to Bayesian statistics.

References

Agnoli, F., Wicherts, J. M., Veldkamp, C. L. S., Albiero, P., & Cubelli, R. (2017). Questionable research practices among italian research psychologists. PLOS ONE, 12(3), e0172792. doi: 10.1371/journal.pone.0172792

Babbage, C. (1830). Reflections on the Decline of Science in England: And on Some of Its Causes. Retrieved from http://www.gutenberg.org/files/1216/1216-h/1216-h.htm

Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., … van ’t Veer, A. (2014). The Replication Recipe: What makes for a convincing replication? Journal of Experimental Social Psychology, 50, 217–224. doi: 10.1016/j.jesp.2013.10.005

Brannon, S. M., Carr, S., Jin, E. S., Josephs, R. A., & Gawronski, B. (2019). Exogenous testosterone increases sensitivity to moral norms in moral dilemma judgements. Nature Human Behaviour. doi: 10.1038/s41562-019-0641-3

Button, K. (2018). Reboot undergraduate courses for reproducibility. Nature, 561, 287. doi: 10.1038/d41586-018-06692-8

Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., … Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637. doi: 10.1038/s41562-018-0399-z

Chambers, C. (2013). Registered reports: A new publishing initiative at Cortex. Cortex; a Journal Devoted to the Study of the Nervous System and Behavior, 49(3), 609–610. doi: 10.1016/j.cortex.2012.12.016

Chambers, C. (2017a). Talks. doi: None

Chambers, C. (2017b). The seven deadly sins of psychology: A manifesto for reforming the culture of scientific practice. doi: 10.1515/9781400884940

Chambers, C., Feredoes, E., Muthukumaraswamy, S. D., & Etchells, P. (2014). Instead of "playing the game" it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond. AIMS Neuroscience, 1, 4–17. Retrieved from https://www.aimspress.com/article/10.3934/Neuroscience.2014.1.4

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65(3), 145–153. doi: 10.1037/h0045186

Cuddy, A. J. C., Schultz, S. J., & Fosse, N. E. (2018). P-Curving a More Comprehensive Body of Research on Postural Feedback Reveals Clear Evidential Value for Power-Posing Effects: Reply to Simmons and Simonsohn (2017) - Amy J. C. Cuddy, S. Jack Schultz, Nathan E. Fosse, 2018. Psychological Science. doi: 10.1177/0956797617746749

Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. doi: 10.1016/j.jesp.2015.10.012

Editorial Nature. (2017). Promoting reproducibility with registered reports. Nature Human Behaviour, 1(1). doi: 10.1038/s41562-016-0034

Editorial Nature. (2019). What science looks like. Nature Human Behaviour. doi: 10.1038/s41562-019-0652-0

Eich, E. (2014). Business Not as Usual. Psychological Science, 25(1), 3–6. doi: 10.1177/0956797613512465

Falk, R., & Greenbaum, C. (1995). Significance Tests Die Hard - the Amazing Persistence of a Probabilistic Misconception. Theory & Psychology, 5(1), 75–98. doi: 10.1177/0959354395051004

Fanelli, D. (2009). How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data. PLOS ONE, 4(5), e5738. doi: 10.1371/journal.pone.0005738

Fiedler, K., & Schwarz, N. (2016). Questionable Research Practices Revisited. Social Psychological and Personality Science, 7(1), 45–52. doi: 10.1177/1948550615612150

Flore, P. C., Mulder, J., & Wicherts, J. M. (2019). The influence of gender stereotype threat on mathematics test scores of Dutch high school students: A registered report. Comprehensive Results in Social Psychology, 1–35. doi: 10.1080/23743603.2018.1559647

Frank, M. C., & Saxe, R. (2012). Teaching Replication. Perspectives on Psychological Science, 7(6), 600–604. doi: 10.1177/1745691612460686

Fraser, H., Parker, T., Nakagawa, S., Barnett, A., & Fidler, F. (2018). Questionable research practices in ecology and evolution. PLOS ONE, 13(7), e0200303. doi: 10.1371/journal.pone.0200303

Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p -Value Ban. The American Statistician, 73(sup1), 374–384. doi: 10.1080/00031305.2018.1537892

Fried, E. I. (2017). The 52 symptoms of major depression: Lack of content overlap among seven common depression scales. Journal of Affective Disorders, 208, 191–197. doi: 10.1016/j.jad.2016.10.019

Friese, M., Loschelder, D. D., Gieseler, K., Frankenbach, J., & Inzlicht, M. (2019). Is Ego Depletion Real? An Analysis of Arguments. Personality and Social Psychology Review, 23(2), 107–131. doi: 10.1177/1088868318762183

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the Past: Failures to Replicate Psi (SSRN Scholarly Paper No. ID 2001721). Retrieved from Social Science Research Network website: https://papers.ssrn.com/abstract=2001721

García-Pérez, M. A., & Alcalá-Quintana, R. (2016). The Interpretation of Scholars’ Interpretations of Confidence Intervals: Criticism, Replication, and Extension of Hoekstra et al. (2014). Frontiers in Psychology, 7. doi: 10.3389/fpsyg.2016.01042

Gendron, M., Crivelli, C., & Barrett, L. F. (2018). Universality Reconsidered: Diversity in Making Meaning of Facial Expressions. Current Directions in Psychological Science, 27(4), 211–219. doi: 10.1177/0963721417746794

Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual : What you always wanted to know about significance testing but were afraid to ask. Retrieved from https://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science”. Science, 351(6277), 1037–1037. doi: 10.1126/science.aad7243

Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45(3), 135–140. doi: 10.1053/j.seminhematol.2008.04.003

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. doi: 10.1007/s10654-016-0149-3

Griggs, R. A. (2014). Coverage of the Stanford Prison Experiment in Introductory Psychology Textbooks. Teaching of Psychology, 41(3), 195–203. doi: 10.1177/0098628314537968

Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., … Zwienenberg, M. (2016). A Multilab Preregistered Replication of the Ego-Depletion Effect. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 11(4), 546–573. doi: 10.1177/1745691616652873

Haller, H., & Kraus, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research, 7(1), 1–20.

Harrington, D., D’Agostino, R. B., Gatsonis, C., Hogan, J. W., Hunter, D. J., Normand, S.-L. T., … Hamel, M. B. (2019). New Guidelines for Statistical Reporting in the Journal. New England Journal of Medicine, 381(3), 285–286. doi: 10.1056/NEJMe1906559

He, J. C., & Côté, S. (2019). Self-insight into emotional and cognitive abilities is not related to higher adjustment. Nature Human Behaviour. doi: 10.1038/s41562-019-0644-0

Heathers, J. (2018). Alright, let’s have a roll-call of the big psychology studied that ate their own teeth for one reason or another. SOCIAL PRIMING. Lots of failed repos.http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html … [Tweet]. Retrieved from https://twitter.com/jamesheathers/status/1006287906087071748

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157–1164. doi: 10.3758/s13423-013-0572-3

Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLOS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. doi: 10.1177/0956797611430953

Kaplan, R. M., & Irvin, V. L. (2015). Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time. PloS One, 10(8), e0132382. doi: 10.1371/journal.pone.0132382

Kerr, N. L. (1998). HARKing: Hypothesizing After the Results are Known. Personality and Social Psychology Review, 2(3), 196–217. doi: 10.1207/s15327957pspr0203_4

Kiers, H., Hoekstra, R., Tendeiro, J., & Van Ravenzwaaij, D. (2019). Unconf - Implications of teaching Bayesian statistics to undergraduate psychology students. Retrieved from https://osf.io/tnuex/

Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, Š., Bernstein, M. J., … Nosek, B. A. (2014). Investigating Variation in Replicability. Social Psychology, 45(3), 142–152. doi: 10.1027/1864-9335/a000178

Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. doi: 10.1177/2515245918810225

Lindsay, D. S. (2015). Replication in psychological science. Psychological Science, 26(12), 1827–1832. doi: 10.1177/0956797615616374

Maes, E., Boddez, Y., Alfei, J. M., Krypotos, A.-M., D’Hooge, R., De Houwer, J., & Beckers, T. (2016). The elusive nature of the blocking effect: 15 failures to replicate. Journal of Experimental Psychology. General, 145(9), e49–71. doi: 10.1037/xge0000200

Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications in Psychology Research: How Often Do They Really Occur? Perspectives on Psychological Science, 7(6), 537–542. doi: 10.1177/1745691612460688

Martinson, B. C., Anderson, M. S., & Vries, R. de. (2005). Scientists behaving badly. Nature, 435(7043), 737. doi: 10.1038/435737a

McKelvie, P., & Low, J. (2002). Listening to Mozart does not improve children’s spatial ability: Final curtains for the Mozart effect. British Journal of Developmental Psychology, 20(2), 241–258. doi: 10.1348/026151002166433

Meehl, P. E. (1967). Theory-Testing in Psychology and Physics: A Methodological Paradox. Philosophy of Science, 34(2), 103–115. Retrieved from http://www.jstor.org/stable/186099

Miller, J., & Ulrich, R. (2016). Interpreting confidence intervals: A comment on Hoekstra, Morey, Rouder, and Wagenmakers (2014). Psychonomic Bulletin & Review, 23(1), 124–130. doi: 10.3758/s13423-015-0859-7

Mobley, A., Linder, S. K., Braeuer, R., Ellis, L. M., & Zwelling, L. (2013). A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic. PLOS ONE, 8(5), e63221. doi: 10.1371/journal.pone.0063221

Morey, R. D., Chambers, C. D., Etchells, P., Harris, C. R., Hoekstra, R., Lakens Daniël, … Zwaan Rolf A. (2016). The Peer Reviewers’ Openness Initiative: Incentivizing open research practices through peer review. Royal Society Open Science, 3(1), 150547. doi: 10.1098/rsos.150547

Morey, R. D., Hoekstra, R., Rouder, J. N., & Wagenmakers, E.-J. (2016). Continued misinterpretation of confidence intervals: Response to Miller and Ulrich. Psychonomic Bulletin & Review, 23(1), 131–140. doi: 10.3758/s13423-015-0955-8

Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman, H., Urry, H. L., Forscher, P. S., … Chartier, C. R. (2018). The Psychological Science Accelerator: Advancing Psychology Through a Distributed Collaborative Network. Advances in Methods and Practices in Psychological Science, 1(4), 501–515. doi: 10.1177/2515245918797607

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C., Percie du Sert, N., … Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021. doi: 10.1038/s41562-016-0021

Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45(3), 137–141. doi: 10.1027/1864-9335/a000192

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability. Perspectives on Psychological Science, 7(6), 615–631. doi: 10.1177/1745691612459058

Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (19852013). Behavior Research Methods, 48(4), 1205–1226. doi: 10.3758/s13428-015-0664-2

Oakes, M. W. (1986). Statistical inference : A commentary for the social and behavioural sciences. Chichester: John Wiley & Sons.

Oostenbroek, J., Suddendorf, T., Nielsen, M., Redshaw, J., Kennedy-Costantini, S., Davis, J., … Slaughter, V. (2016). Comprehensive Longitudinal Study Challenges the Existence of Neonatal Imitation in Humans. Current Biology, 26(10), 1334–1338. doi: 10.1016/j.cub.2016.03.047

OSC. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. doi: 10.1126/science.aac4716

Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women. Psychological Science, 26(5), 653–656. doi: 10.1177/0956797614553946

Reicher, S., & Haslam, S. A. (2006). Rethinking the psychology of tyranny: The BBC prison study. British Journal of Social Psychology, 45(1), 1–40. doi: 10.1348/014466605X48998

Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the Future: Three Unsuccessful Attempts to Replicate Bem’s “Retroactive Facilitation of Recall” Effect. PLoS ONE, 7(3). doi: 10.1371/journal.pone.0033423

Sarafoglou, A., Hoogeveen, S., Matzke, D., & Wagenmakers, E.-J. (2019). Teaching Good Research Practices: Protocol of a Research Master Course. Psychology Learning & Teaching, 1475725719858807. doi: 10.1177/1475725719858807

Schimmack, U. (2015). Questionable Research Practices: Definition, Detect, and Recommendations for Better Practices. Retrieved from https://replicationindex.com/2015/01/24/questionable-research-practices-definition-detect-and-recommendations-for-better-practices/

Schönbrodt, F. (2015). Questionable Research Practices. Retrieved from https://osf.io/bh7zv/

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. doi: 10.1177/0956797611417632

Spreckelsen, T. F. (2018). Editorial: Changes in the field: Banning p-values (or not), transparency, and the opportunities of a renewed discussion on rigorous (quantitative) research. Child and Adolescent Mental Health, 23(2), 61–62. doi: 10.1111/camh.12277

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 11(5), 702–712. doi: 10.1177/1745691616658637

Steele, K. M., Bass, K. E., & Crook, M. D. (1999). The Mystery of the Mozart Effect: Failure to Replicate. Psychological Science, 10(4), 366–369. doi: 10.1111/1467-9280.00169

Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1), 1–2. doi: 10.1080/01973533.2015.1012991

Vadillo, M. A., Gold, N., & Osman, M. (2018). Searching for the bottom of the ego well: Failure to uncover ego depletion in Many Labs 3. Royal Society Open Science, 5(8), 180390. doi: 10.1098/rsos.180390

van der Zee, T., Anaya, J., & Brown, N. J. L. (2017). Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition, 3(1), 54. doi: 10.1186/s40795-017-0167-x

Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. doi: 10.1177/1745691616674458

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129–133. doi: 10.1080/00031305.2016.1154108

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p \(<\) 0.05”. The American Statistician, 73(sup1), 1–19. doi: 10.1080/00031305.2019.1583913

Watts, T. W., Duncan, G. J., & Quan, H. (2018). Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes. Psychological Science, 29(7), 1159–1177. doi: 10.1177/0956797618761661

Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Frontiers in Psychology, 7. doi: 10.3389/fpsyg.2016.01832

Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604. doi: 10.1037/0003-066X.54.8.594